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Abstract 

Background: Inference of gene regulatory networks (GRNs) from gene microarray expression data is of great 
interest and remains a challenging task in systems biology. Despite many efforts to develop efficient computational 
methods, the successful modeling of GRNs thus far has been quite limited. To tackle this problem, we propose a 
novel framework to reconstruct radio-responsive GRNs based on the graphical lasso algorithm. In our attempt to 
study radiosensitivity, we reviewed the literature and analyzed two publicly available gene microarray datasets. The 
graphical lasso algorithm was applied to expression measurements for genes commonly found to be significant in 
these different analyses. 

Results: Assuming that a protein-protein interaction network obtained from a reliable pathway database is a gold- 
standard network, a comparison between the networks estimated by the graphical lasso algorithm and the gold- 
standard network was performed. Statistically significant p-values were achieved when comparing the gold- 
standard network with networks estimated from one microarray dataset and when comparing the networks 
estimated from two microarray datasets. 

Conclusion: Our results show the potential to identify new interactions between genes that are not present in a 
curated database and GRNs using microarray datasets via the graphical lasso algorithm. 



Background 

In recent years, there has been a great interest in identify- 
ing radio-responsive genes across the whole genome 
using gene microarray data in the field of radiation 
oncology. To develop new biomarkers for radiation expo- 
sure, Templin et al. used whole genome microarray and 
miRNA data generated from blood samples of patients 
who underwent total body irradiation in preparation for 
stem cell transplantation [1]. Rieger and Chu utilized oli- 
gonucleotide microarrays with cell lines collected from 
15 healthy individuals to investigate the transcriptional 
response of 10,000 genes in DNA damage to ionizing 
radiation (IR) and ultraviolet (UV) radiation [2]. In 
another study, Rieger et al. explored transcriptional 
responses to radiation in lymphoblastoid cells collected 
from 57 patients and found 20 IR-responsive and 4 UV 
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radiation-responsive genes predictive of radiation toxicity 
[3]. Eschrich et al. employed systems biology modeling to 
better understand radiosensitivity by identifying novel 
radiation-specific biomarkers [4]. With gene expression 
profiles from 48 human cancer cell lines, this biomarker 
discovery platform resulted in a key radiosensitivity net- 
work with 10 hub genes. In our previous study [5], we 
identified important radio-responsive genes using litera- 
ture review and gene microarray data analysis. With a 
systems biology approach, we found a core radio-respon- 
sive protein interaction network and its key biological 
processes using gene ontology analysis. 

Gene regulatory networks (GRNs) provide simplified 
representations and easy interpretation of biological pro- 
cesses in an organism under given conditions [6] . However, 
inference of GRNs remains a major challenging problem in 
systems biology, although a number of approaches have 
been proposed [7]. Cai et al. employed sparse structural 
equation models (SEMs) that integrate gene expression 
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and tis-expression quantitative trait loci data for improving 
inference accuracy and proposed a systematic inference 
method for SEM estimation [8]. Based on Bayesian analysis 
using a non-parametric Gaussian process modeling, a 
novel method for inferring GRNs was proposed by Aijo 
and Lahdesmaki [9]. This approach enables the use of both 
time-series and steady-state gene expression measurements 
to improve the inference of GRNs. Menendez et al. used a 
Gaussian Markov Random Field (GMRF) to deal with the 
problem of reverse engineering in GRNs and applied the 
graphical lasso algorithm to effectively estimate undirected 
graphical models [10]. Applying a lasso penalty to the pro- 
blem of inverse covariance matrix estimation facilitated a 
fast and efficient calculation. The graphical lasso algorithm, 
which uses a blockwise coordinate descent approach 
to estimate a sparse graphical model, was proposed by 
Friedman et al. [11]. 

In this study, we employed a multi-component filter- 
ing process, based on a systems biology approach that 
was proposed in our previous study [5], that narrows 
down the list of gene candidates and applied the graphi- 
cal lasso algorithm to microarray datasets to infer radio- 
responsive GRNs. To estimate the accuracy of the net- 
work modeling, the estimated networks were compared 
with a reliable protein interaction network produced by 
a manually curated protein interaction database. 

Methods 

Datasets 

In our previous study [5], we attempted to investigate all 
putative genes implicated in radiation response through 
literature review using PubMed and Scopus search engines 
and found a list of 221 genes associated with radiation 
response. In addition, to identify significant radio-respon- 
sive genes, biological processes, and pathways, further 
analysis was performed using two publicly available micro- 
array datasets (GSE23393 [1] and GSE1977 [2]) down- 
loaded from Gene Expression Omnibus (GEO) database. 
In GSE1977, lymphoblastoid cells collected from 15 
healthy individuals were exposed to 5-Gy radiation doses 
and harvested 4 hours later. To examine the change in 
gene expression after IR, only mock and IR cases were 
used. In GSE23393, blood was obtained from eight radio- 
therapy patients treated at our institution immediately 
before irradiation and at 4 hours after total body irradia- 
tion with 1.25-Gy X-rays. In this study, to explore GRNs 
associated with radiation response, we used the resulting 
information obtained from our literature review and 
reanalyzed the two microarray datasets. 

Identification of significant genes 

In the preprocessing stage, gene expression measurements 
were log-base-2 transformed, followed by quantile normal- 
ization across all samples. In our previous study [5], to 



identify significant genes that have considerable differen- 
tial expression between before and after irradiation, we 
used a permutation test. In this study, we employed Signif- 
icant Analysis of Microarrays (SAM) that resulted in a 
false discovery rate (FDR) and fold-change for each gene 
[12]. To minimize the possibility of omitting interesting 
genes, significant genes identified in the two different tech- 
niques were combined for further analysis. 

Pathway analysis 

Significant genes identified in our analysis were entered 
into a manually curated pathway database (MetaCore™, 
GeneGo, Inc.). This system leads to the most probable 
protein interaction network based on a list of genes 
uploaded by the user. We converted this resulting net- 
work to an undirected network and used it as a gold- 
standard network to assess the GRNs estimated by the 
graphical lasso algorithm. 

Graphical lasso algorithm 

In recent years, increasing attention has been paid to the 
estimation of sparse inverse covariance using L x (lasso) 
regularization [11,13,14]. This approach has been effi- 
ciently applied to the investigation of sparse undirected 
graphical models. In the graphical model, a node repre- 
sents a feature (gene or protein in this study) and an 
edge between two nodes represents the relationship 
between the two corresponding features. In particular, 
each nonzero off-diagonal element in the inverse covar- 
iance matrix indicates that there is a dependency 
between the two features. That is, as the number of 
zero off-diagonal elements in the inverse covariance 
matrix increases, sparser graphs are yielded. 

The underlying assumption behind the graphical lasso 
is that a data matrix X nxp consisting of p features mea- 
sured on n observations follows a multivariate Gaussian 
distribution with mean ^ and covariance E [10]. Let 
0 = X -1 be the precision matrix, and let S denote the 
covariance matrix of the data. The problem is to maxi- 
mize the penalized log-likelihood over nonnegative defi- 
nite matrices 0, taking the form 

log (det (0)) - trace (S0) — k || ©||i (1) 

where || 0||i is the L\ norm that is the sum of the abso- 
lute values of the elements of J -1 , and A is a nonnegative 
tuning parameter that controls the sparsity of the esti- 
mated network. More specifically, large values of A lead to 
sparse networks due to the lasso-type penalty, whereas 
small values of X lead to dense networks. Note that the 
problem that maximizes the penalized log-likelihood is 
convex [11]. The subgradient equation for Eq. (1) is 

-|-J(0)=W-S-A-sign(0) = O (2) 
30 
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where W = © The block coordinate descent 
approach partitions the rows and columns such that the 
target column is always the last, cycling through all fea- 
tures in turn. The partition of 0, W, and S is defined as: 
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where the size of On, Wu, and Su is (p - 1) x (p - 1); 
the size of 6i 2 , W12, and S12 is (p - 1) x 1; and # 2 2> ^22, 
and 522 are scalars. Inspired by this approach, Friedman 
et al. proposed to use a conventional lasso algorithm to 
solve Eq. (2). Here we describe how to solve Eq. (2). 
The upper-right block of Eq. (2) is 



wu -Si2-X- sign (0i 2 ) = 0. 
The lower-right block of Eq. (2) is 

U>22 — S22 — ^ = 0. 

Since w = 0 _1 > we nave 
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(3) 



(4) 



(5) 



from which we derive 

Wn0i2 + Wi2#22 = 0, 
w 12^12 + 1^22^22 = !■ 

Then, we have 
012 = —OuP, 

Oil = 1/(^22 - w[ 2J 6) 

where = W^/wn. Since 0 2 2 > 0,sign (0i 2 ) in Eq. (3) 
is equal to sign {-0 12 P) =-sign (p) . Therefore, Eq. (3) 
can be rewritten as 



We note that Eq. (6) is equivalent to the gradient equa- 
tion of the regular lasso problem. At each partition, a 
lasso regression is fitted. Then, W12 and W22 are calcu- 
lated and inserted into W before a new partition is made. 
This procedure is repeated until W is converged. Table 1 
summarizes the graphical lasso algorithm. 

Evaluation of estimated gene regulatory networks 

To compare the GRNs estimated using the graphical 
lasso algorithm with a gold-standard network produced 
by MetaCore software, we used Recall, Precision, and 
/-score metrics defined as follows [15]: 



Precision 



Recall 



TP 



TP + FP' 



TP 



TP + FN 



score = 



2 x Precision x Recall 
Precision + Recall 



W u ^-Si2+A-sign(^) = 0. 



where TP indicates the true positives (correctly 
inferred edges); FP represents the false positives (edges 
inferred in the estimated network, but not present in 
the gold-standard network); and FN signifies the false 
negatives (edges present in the gold-standard network, 
but not inferred). 

Results 

Identification of significant genes using microarray 
datasets 

Using SAM, significant genes were identified for the 
GSE1977 and GSE23393 datasets. For GSE1977, 61 genes 
were identified, with a cutoff of FDR < 20%, including 44 
induced genes and 17 repressed genes. For GSE23393, 
64 genes were identified, with a cutoff of FDR < 20%, 
including 19 induced genes and 45 repressed genes. Note 
that more genes were induced in GSE1977, whereas more 
genes were repressed in GSE23393. We examined com- 
monly found genes in the two datasets. Table 2 shows the 
21 overlapping genes with their fold-change and FDR. For 
these genes, considerable fold-changes were observed: 
averaged fold-changes for induced and repressed genes 
were 3.02 and 0.53, respectively, in GSE1977 and 3.12 and 

Table 1 Graphical lasso algorithm. 

1 . Initialize W = S + XI. Note that the diagonal of W is fixed in what follows. 

2. Repeat the following steps until W is converged. 

a. Partition the matrix W such that the target column is last. 

b. Solve the lasso problem using the coordinate descent algorithm. 

c. Update W12 = Wu/). 

3. In the final cycle, calculate 



(6) 012 = -P&22 With 6> 22 = 1/(^22 ~ w{ 2 /3> 
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Table 2 A list of 21 genes commonly identified in both microarray datasets using Significant Analysis of Microarrays. 

Gene Symbol Entrez Gene ID GSE1977 GSE23393 
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PLK2 


10769 
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3.63 


0.00 




PLK3 
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0.00 
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PPM1D 
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2.46 


0.00 


1.64 


0.00 




TNFRSF10B 
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2.33 


0.00 


2.46 


0.00 


Repressed Genes 


BIRC5 
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0.63 


16.09 


0.43 


3.52 




CCNB1 
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0.50 


0.00 




HUS1 
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0.60, respectively, in GSE23393. There was no significant 
difference in fold-change between GSE1977 and GSE23393 
(p = 0.64 using Wilcoxon signed-rank test). It is worthwhile 
to note that induced genes had relatively lower FDR than 
repressed genes in both datasets. 

In our previous study [5], we identified 20 genes that 
were significant in both datasets using a permutation 
test. It was found that 15 genes out of 20 were re-identi- 
fied in this study, with 5 genes (BTG2, CDKN1A, 
MDM2, MR1, and XPC) excluded in SAM analysis. To 
minimize the possibility of excluding important genes, 
we combined the two gene lists identified in the SAM 
analysis and the permutation test, resulting in a list of 
26 genes. Note that all 26 genes are in the list of 221 
genes found in our literature review [5]. 

Pathway analysis results 

These 26 genes were fed into MetaCore software to iden- 
tify a key interacting network. Figure 1 illustrates a directly 
connected protein-protein interaction network produced 
by MetaCore. This network consisted of 16 directly con- 
nected nodes and 28 edges. For the remaining 10 genes, 
there was no single connection. In this network, the MYC 
gene seems to play a key role as a hub gene with 12 con- 
nections. We assume that this network is reliable, consid- 
ering it as a gold-standard network to assess the accuracy 
of the GRNs estimated by the graphical lasso algorithm. 
Due to the static nature of the microarray datasets and the 



unidirectional property of GRNs resulting from the gra- 
phical lasso algorithm, we removed the directionality from 
the network in Figure 1 to ease the comparison between 
the unidirectional gold-standard network and the esti- 
mated GRNs. 

Identification of gene regulatory networks 

For the 16 genes shown in Figure 1, expression measure- 
ments after irradiation were extracted from the GSE1977 
and GSE23393 microarray datasets and were used in the 
graphical lasso algorithm. Using different k values, a large 
range of candidate networks were yielded. Table 3 
summarizes our experimental results. Table 3(A) and 
(B) show comparison results of networks estimated from 
GSE1977 and GSE23393 datasets, respectively, with the 
gold-standard network. To calculate a j?-value of a result 
obtained for each X in Table 3, a simulation was per- 
formed. For example, using — log 10 (A) = 1.2 in GSE1977, 
the estimated network had 27 edges with precision = 0.41, 
recall = 0.39, and /-score = 0.4. The common number of 
edges between the estimated network and the gold-stan- 
dard network was 11. For the simulation, we randomly 
created two networks, both having 16 nodes: one with 28 
edges (the number of edges in the gold-standard network) 
and the other with 27 edges, as in the estimated network. 
Then, a /-score was calculated between the two networks. 
This procedure was repeated 10,000 times. From our 
simulation results, the number of times that the /-score 
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Figure 1 A directly connected protein interaction network. This protein interaction network was obtained when 26 genes were entered into 
MetaCore software. This network consists of 16 nodes and 28 edges. The MYC gene has 12 connections. The table on the right shows corresponding 
gene symbols for proteins in the network. Red, green, and gray lines indicate inhibitory, stimulatory, and unspecified interactions, respectively. 



was larger than 0.4 (note that the /-score in the above 
example was 0.4) was 50. Therefore, its /?-value was 
50/10000 = 0.005. As shown in Table 3, the networks 
estimated from GSE1977 had larger /-scores than those 
estimated from GSE23393 and overall, their jj-values were 
statistically significant. 

Table 3(C) shows results of the comparison between the 
networks estimated from GSE1977 and GSE23393 data- 
sets. We note that the /-scores were greater than those 
between the gold-standard network and the networks esti- 
mated using GSE1977 or GSE23393. Their ^-values were 
statistically significant when — log 10 (l) > 1.1 in GSE1977 
and -log 10 (A.) > 0.8 in GSE23393. However, as 
— log 10 (A) values increased, the complexity of models (the 
number of edges in estimated networks) also increased. 
We compared the gold-standard network with estimated 
networks that have the number of edges that was similar 
to the gold-standard network (the number of edges: 28). 
In comparison between the gold-standard network and 
networks estimated from GSE1977 with — log 10 (A) = 1.2, 
the number of edges was 27, /-score was 0.4, and j?-value 



was 0.005. In comparison between the gold-standard net- 
work and networks estimated from GSE23393 with 
— log 10 (A) = 1.04, the number of edges was 30, /-score 
was 0.24, and /"-value was 0.337. In a comparison of 
the two networks estimated using GSE1977 (with 
-log 10 (;g = 1.2) and GSE23393 (with -log 10 (A.) = 1.04), 
the /-score was 0.42 and /rvalue was 0.004. Figure 2 shows 
the change in /-scores calculated from two networks esti- 
mated using GSE1977 and GSE23393 with different A. 
values. Figure 3 illustrates the change in networks esti- 
mated when the graphical lasso algorithm was applied to 
GSE1977 with 6 different X values. 

Discussion 

To identify radio-responsive GRNs, we employed the gra- 
phical lasso algorithm using a list of radio-responsive 
genes selected from literature review and microarray data 
analysis. For the identification of significant genes with 
two publicly available microarray datasets (GSE1977 and 
GSE23393), we used SAM analysis. As a result, we found 
21 genes to be important with FDR < 20%. These 21 genes 
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Table 3 Results obtained using the graphical lasso algorithm. 
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(A) Comparison of the gold-standard network and networks estimated from GSE1977, 
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(C) comparison of networks estimated from GSE1977 and GSE23393. 











included 15 of 20 genes that we identified in our previous 
study using a permutation test. This result suggests that 
we may miss important genes by using only a single statis- 
tical approach. In this study, we combined the two gene 
sets for further analysis, resulting in a list of 26 genes, 
including genes related to DNA repair: GADD45A, XPC, 
DDB2, and PCNA [2]. 

These 26 genes were entered into MetaCore software 
for a systems biology analysis. We identified a protein- 
protein interaction network that was used as a gold- 
standard network in this study. This network consisted 
of 16 nodes and 28 edges. Interestingly, the MYC gene 
had 12 connections, implying that this gene may play an 
important role in DNA-damage-related biological 



functions. It has been known that MYC is a key regula- 
tor of cell proliferation and apoptosis [16-19]. However, 
the role of MYC is not fully understood, due to its con- 
tradictory effect in enhancing or reducing radiorespon- 
siveness [20,21]. 

The /-scores calculated to compare the gold-standard 
network with the networks estimated from GSE1977 
were larger than those found in comparing the gold- 
standard network with the networks estimated from 
GSE23393. It was noted that the /-scores calculated 
from the two networks estimated from GSE1977 and 
GSE23393 were larger than those calculated from the 
gold-standard network and the networks estimated from 
GSE1977. This means that there are some common 
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Figure 2 Accuracy of estimated networks. The f-scores calculated from two networks estimated using GSE1977 and GSE23393 with different X values. 




Figure 3 Change in estimated networks. Networks estimated when the graphical lasso algorithm was applied to GSE1977 with 6 different X values. 
* 
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edges between the two networks estimated from 
GSE1977 and GSE23393, which are not present in the 
gold-standard network, suggesting that these edges 
could represent unknown associations between the 
genes. 

In this study, we compared the estimated networks 
with a gold- standard network to investigate the change 
in the accuracy and number of interactions between 
genes with different X values. However, in general, one 
needs to find the best regularization parameter X, taking 
into account a tradeoff between model prediction and 
model complexity [10,22]. Bayesian information criterion 
(BIC) and Akaike criterion (AIC) are widely used for 
model selection. 

Despite the lack of available datasets regarding radia- 
tion response, we demonstrated the presented metho- 
dology has the potential to identify radio-responsive 
GRNs via the graphical lasso algorithm based on litera- 
ture review and microarray data analysis. 

Conclusions 

We demonstrated that the graphical lasso algorithm can 
be a useful tool to reconstruct GRNs. We used a biomar- 
ker filtering method proposed in our previous study based 
on literature review and microarray data analysis. To eval- 
uate the accuracy of radio-responsive GRNs estimated 
from two publicly available microarray datasets, we used a 
reliable protein interaction network generated from a 
curated database as a gold-standard network. It is expected 
that our proposed method can be efficiently used to iden- 
tify not only significant radio-responsive genes, but also 
radio-responsive GRNs. 
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